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Abstract: In this article, we study rates of convergence of the general- 
ization error of multi-class margin classifiers. In particular, we develop an 
upper bound theory quantifying the generalization error of various large 
margin classifiers. The theory permits a treatment of general margin losses, 
convex or nonconvex, in presence or absence of a dominating class. Three 
main results are established. First, for any fixed margin loss, there may be 
a trade-off between the ideal and actual generalization performances with 
respect to the choice of the class of candidate decision functions, which 
is governed by the trade-off between the approximation and estimation 
errors. In fact, different margin losses lead to different ideal or actual per- 
formances in specific cases. Second, we demonstrate, in a problem of linear 
learning, that the convergence rate can be arbitrarily fast in the sample 
size n depending on the joint distribution of the input/output pair. This 
goes beyond the anticipated rate 0(n~^). Third, we establish rates of con- 
vergence of several margin classifiers in feature selection with the number 
of candidate variables p allowed to greatly exceed the sample size n but no 
faster than exp(n). 
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1. Introduction 

Large margin classification has seen significant developments in the past several 
years, including many well-known classifiers such as Support Vector Machine 
(SVM, (0)) and Neural Networks. For margin classifiers, this article investigates 
their generalization accuracy in multi-class classification. 

In the literature, the generalization accuracy of large margin classifiers has 
been investigated in two-class classification. Relevant results can be found in. 



for example, fSl), (j29r ) and (jl4f ). For multi-class classification, however, there are 
many distinct generalizations of the same two-class margin classifier; see Section 
[3] for a further discussion of this aspect. As a result, much less is known with 
regard to the generalization accuracy of large margin classifiers, particularly 
its relation to presence/absence of a dominating class, which is not of concern 
in the two-class case. Consistency has been studied in Q, and (HH). To our 
knowledge, rates of convergence of the generalization error have not been yet 
studied for general margin classifiers in multi-class classification. 

In the two-class case, the generalization accuracy of a large margin classifier 
is studied through the notion of Fisher consistency (cf., |l5l ): ((soi)). where the 
Bayesian regret Regret{f , /) is used to measure the discrepancy between an es- 
timated decision function / and /, the (global) Bayes decision function over all 
possible candidate functions. When a specific class of candidate decision func- 
tions and a surrogate loss V are used in classification, / is often not the 
risk minimizer defined by V over ^. Then an approximation error of ^ to / 
with respect to V is usually assumed to yield an upper bound of Regret(f, /), 
expressed in terms of an approximation error plus an estimation error of es- 
timating the decision function. One major difficulty with this formulation is 
that the approximation error may dominate the corresponding estimation error 
and be non-zero. This occurs in classification with linear decision functions; see 
Section [Ol for an example. In such a situation, well-established bounds for the 
estimation error become irrelevant, and hence that such a learning theory breaks 
down when the approximation error does not tend to zero. 

To treat the multi-class margin classification, and circumvent the aforemen- 
tioned difficulty, we take a novel approach by targeting at Regret{f , /^) with 
the risk minimizer over given V. Toward this end, we study the ideal gen- 
eralization performance of and the mean-variance relationship of the cost 
function. This permits a comparison of various margin classifiers with respect 
to the ideal and actual performances respectively described in Sections [3] and 
m bypassing the requirement of studying the Fisher consistency. As illustrated 
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in Section [5. 2( we show that the rate of convergence of the gcnerahzation error 
of certain large margin classifiers can be arbitrarily fast in linear classification, 
depending on the joint distribution of the input /output pair. Moreover, in lin- 
ear classification, the ideal generalization performance is more crucial than the 
actual generalization performance, whereas in nonlinear classification the ap- 
proximation error becomes important to the actual generalization performance. 
Finally, we treat variable selection in sparse learning in a high-dimensional sit- 
uation. There the focus has been on how to utilize the sparseness structure 



to attack the curse of high dimensionality, c.f, (|31l ) and ||12|). Our formulation 
permits the number of candidate variables p greatly exceeding the sample size 
n. Specifically, we obtain results for several margin classifiers involving feature 
selection, when p grows no faster than exp(n). This illustrates the important 
role of penalty in sparse learning. 

This article is organized as follows. Section[2]introduces the notation of gener- 
alized multi-class margin losses to unify various generalizations of two-class mar- 
gin losses. Section [3] discusses the ideal generalization performance of with 
respect to V, whereas Section |4] establishes an upper bound theory concerning 
the generalization error for margin classifiers. Section [5] illustrates the general 
theory through four classification examples. The Appendix contains technical 
proofs. 



2. Multi-class and generalized margin losses 

In /c-class classification, a decision function vector / = (/i, ■ ■ ■ , /fc), with fj rep- 
resenting class j, mapping from input space X gW'- to M, is estimated through 
a training sample Zi = (JC^, Y^)"^]^, independent and identically distributed ac- 
cording to an unknown joint probability P{x, y), where Yi is coded as {1, • • • , fc}. 
For an instance x, classification is performed by rule argmaxi<j<fe /j(a;), as- 
signing a; to a class with the highest value of fj{x)\ j = 1, • • • , fc. The classifier 
defined by argmaxi<j<fe /j(a;) partitions X into fc disjoint and exhaustive re- 
gions Xi, - ■ ■ , Afc. To avoid redundancy in /, a zero-sum constraint fj ~ 
is enforced. Note that fj;j^l,---,k are not probabilities. 

In multi-class margin classification, there are a number of generalizations of 
the same two-class method. We now introduce a framework using the notion 
of generalized margin, unifying various generalizations. Define the generalized 
functional margin u{f{x),y) as {fy{x) - }i{x), . . . , fy{x) - fy^i{x)Jy{x) - 
fy+i{x), . . . , fy{x) - fk{x)) = (ui,--- ,Uk-i), comparing class y against the 
remaining classes. When fc = 2, it reduces to the binary functional margin 
fy — fc^y, which, together with the zero-sum constraint, is equivalent to yf{x) 
with y = ±1. Within this framework, we define a generalized margin loss 

V{f,z)^h{u{f{x),y)) 

for some measurable function h and z = {x,y), where V is called large margin if 
it is nondecreasing with respect to each component of u{ f{x), y), and V is often 
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called a surrogate loss when it is not the 0-1 loss. For import vector machine 

(Hi, 

fe-1 

h{u) = hiogit{u) = log(l + ^exp(-Uj)). 

i=i 

For multi-class SVMs, several versions of generalized hinge loss exist. The gen- 
eralized hinge loss proposed by lii), (0), (0), (0), and Inl) is defined by 



fe-1 

h{u) = hi{u) =^[l- Uj] + ; 

3 = 1 

the generalized hinge loss in (flsh is defined by 

fe-1 y^fe-l ^ 

h{u) = hsyjn2{u) = ~ + 1] + ; 

the generalized hinge loss in ([l6l ) is defined by 

= hsvmsiu) = [1 - min + . 

{i<i<fe-i} 

For multi-class ^/^-Icarning, = h^{u) = V'(iiiiii{i<j<fc-i} Mj) is the general- 
ized tA-Ioss by (lid ), with -0(3;) = for x > 1; 2 for x < 0; 2(1 — x) for < cc < 1. 
For multi-class boosting (|32h . = hi^{u) = (1 — min{i<j<j,_i} Wj)-^ is a 

generalized squared loss. Interestingly, for the 0-1 loss, h{u) = L{f, Z) = 
/[min{i<j<fe_i} Uj < 0]. 

For classification, a penalized cost function is constructed through V{f, Z): 

n 

n-'J2^{f,Z,) + \J{f) (2.1) 

i=l 

where J{f) is a nonncgative penalty penalizing undesirable properties of /, and 
A > is a tuning parameter controlling the trade-off between training and J{f). 
The minimizer of (EH) with respect to / e :F = {(/i, •••,/*.) e :F : J2''j=i fj = 
0}, a class of candidate decision function vectors, yields / ~ (/i, • • • , fk) thus 
classifier argmaxj=i^... fj- 

In classification, J{f) is often the inverse of the geometric margin defined by 
various norms or the conditional Fisher information (0). For instance, in linear 
SVM classification with feature selection, the inverse geometric margin with re- 
spect to a linear decision function vector / is defined as ^ J2'j=i II '"'j 111' cf., (0), 
where fj{x) = {wj,x)+bj, j = 1, • • • , fc, with (•, •) the usual inner product in R'^, 
bj G K, and || • ||i is the usual Li norm. In standard kernel SVM learning, the in- 
verse geometric margin becomes i I]^^i IISj III: = 5 J27=i Ylk=i oi{a{K{xi, Xj), 
where fj has a kernel representation of gj{x) + bj = J27=i ctfKix, Xi) + bj. Here 
K{-, •) is symmetric and positive semi-definite, mapping from A" x A" to R, and 
is assumed to satisfy Mercer's condition ([l7h so that \\g\\%; is a norm. 
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3. Ideal generalization performance 

The generalization error (GE) is often used to measure the generalization accu- 
racy of a classifier defined by /, which is 

Err{f) = P{Y ^ arg max f,{X)) = EL{f, Z), 

with multi-class misclassification (0-f) loss L{f, z) = I{Y 7^ arg max fj{X)). 

j=i.- - M 

The corresponding empirical generalization error (EGE) is S"=i ^(/j ^i)- 
Often a surrogate loss V is used in (|2.ip as opposed to the 0-1 loss for 
a computational consideration. In such a situation, (|2.ip targets at the min- 
imizer = arginfyg^£'V^(/, Z), which may not belong to Consequently, 
EV{f^ , Z) represents the ideal performance under V, whereas EL{f^ , Z) is 
the ideal generalization performance of when V is used in (|2.ip . Now define, 

for feJ', 

e{f,f) = EL{f,Z)-EL{f,Z), 
ey(/,/^) = EV{f,Z)-EV{f'',Z). 

Note that for / e ev{f,f^) > but e{f,f^) may not be so, depending 
on the choice of V. In this article, we provide a bound of |e(/, /^)| to measure 
the discrepancy between the actual performance and ideal performance of a 
classifier defined by / in generalization. 

It is worthwhile to mention that for two margin losses Vi; i = 1,2, the 
ideal generalization performances determine the asymptotic behavior of their 
actual generalization performances of the corresponding classifiers defined by 
Therefore, if EL{f^\Z) < EL[f^\Z) then EL{fi,Z) < EL{f2,Z) even- 
tually provided that |e(/i,/^')| as n ^ 00. Consequently, a comparison 
of |e(/i, f^^)\ with |e(/2, /^^)| is useful only when their ideal performances are 
the same, that is, EL{f^\Z) = EL{f^^,Z). 

To study the ideal generalization performance of with respect to V, 
let / be the (global) Bayes rule, obtained by minimizing Err{f) with re- 
spect to all /, including j ^ T. Note that the (global) Bayes rule is not 
unique but its error is unique with respect to loss L, because any /, satis- 
fying argmaXj/j(a;) = argmax^ (x) with Pj{x) = P{Y ~ j|X = x\ yields 
the same minimal. Without loss of generality, we define / = (/i, . . . , fk) with 
fi{x) = if I = argmaxPj (x) , and — otherwise. 

Let Vsvmj and V^i, be margin losses defined by hsvmj and /i^, respectively. 

Lemma 1. If ^ is a linear space, then 

ELif"", Z) > EL{f''\Z) = SL(/^, Z) > EL{f, Z), 

for any margin loss V. If in addition, for generalized hinge losses Vsvmj, j S 
{1,3}, it is separable in that EV{f^'^"^^ , Z) = 0, then 

ELif"", Z) > i;i(/^— , Z) = ELif""^ , Z) = EL{f'^,Z) > EL{f, Z). 
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Lemma [T] concerns in both the separable and nonseparable cases, and 
ysvmj] j = 1,3, in the separable case, in relation to other margin losses. For 
other margin losses, such an inequality may not hold generally, depending on ^ 
and V. Therefore a case by case examination may be necessary; see Section ISTTl 
for an example. 

4. Actual generalization performance 

In our formulation, ^ is allowed to depend on the sample size n; so is defined 
by When depends on n and approximates /* (independent of n) in that 
|ev(/^,/*)| — > as n oo, it seems sensible to use |e(/,/*)| to measure 
the actual performance as opposed to |e(/, /^)|. Without loss of generality, we 
assume that V >0. 

Let fo = f* when /* e otherwise /o G ^ is chosen such that ey(/o, /*) < 
with £„ defined in Assumption ICl Now define truncated y-loss to be 

V^if,Z) = V{f,Z)AT, 

for any f ^ T and some truncation constant T > 0, where A defines the 
minimum. Define 

eyT(/, r ) = E{Y^{S, Z) - V^{f\Z)). 

The following conditions are assumed based on the bracketing L2 metric 
entropy and the uniform entropy. 

Assumption A: (Conversion) There exists a constant T > independent 
of n such that T > max{V{fo, Z),V{f* , Z)) a.s., and there exist constants 
< a < 00 and ci > such that for all < e < T and / G :F, 

sup |e(/,r)| <cie". 

{feJ':e^T{f,f'}<<^} 

Assumption B: (Variance) For some constant T > 0, there exist constants 
/3 > and C2 > such that for all < e < T and f e J^, 

sup Var{V^{f,Z)-V{r,Z))<C2e''. 

{feT:e^Tif.J')<e} 

To specify Assumption [Cl we define the L2-bracketing metric entropy and 
the uniform metric entropy for a function space Q ~ {g} consisting of function 
g's. For any e > 0, call {(51,3"), . . . , (ffm,5m)} an e-bracketing set of Q if for 
any g ^ Q there exists an j such that gj<g< and Wg^ — gj\\2 < e, where 
II5II2 = {Eg'^y/'^ is the usual I/2-norm. The metric entropy HBie,Q) of Q with 
bracketing is then defined as the logarithm of the cardinality of e-bracketing set 
of Q of the smallest size. Similarly, a set (gi, • • • , gm) is called an e-nct of ^, if for 
any g £ Q, there exists an j such that ||<?j— 5|1q,2 < e, where |j • \\q^2 is the L2{Q)- 
norm with respect to Q, defined as ||5||q,2 = (/ g^d-QY^"^. The L2('5)-nietric 



X. Shen and L. Wang/Multi-class margin classification 



313 



entropy HQ{e,Q) is the logarithm of the covering number — minimal size of all 
e-nets. The uniform metric entropy is defined as Hu{e, Q) = supg Hq^e, Q). 
Let Jo = max(J(/o),l), and Tv{s) = {Z ^ V^{f,Z) - V{f^,Z) : f e 

:F,Jif) < Jos}. 

Assumption C: (Complexity) For some constants c; > 0; i = 3, • • • , 5, there 
exists En > such that 

sup 0(e„, s) < can^/^, (4.1) 

where (/)(e„, s) = /^''^ iJ ^' ^(u, J^y (s))(iM/L and L = L(e„, A, s) = min(e^ + 
XJois/2 - 1), 1), where i?(-, •) is Hb{; •) or Hu{; •)• 

AssumptionlAlspecifies a relationship between e{f,f*) and eyr (/,/*), which 
is a first moment condition. Assumption IB| on the other hand, relates e(/, /*) 
to variance of (V^if, Z) - V{f*,Z)). Evidently Var{V^{f, Z) - V{f*,Z)) < 
min(For(V(/, Z) - V {f\ Z)),T'^), which implies that ^ = in the worst case. 
Exponents a and (3 in Assumptions [Al and IB] are critical to determine the speed 
of convergence of e{f,f*), although eyT(/,/*) may not converge fast. As il- 
lustrated in Section 15.21 an arbitrarily fast rate is achievable in large margin 
linear classification, because a can be arbitrarily large. Assumption [B] appears 
to be important in discriminating several classifiers in the linear and non-linear 
cases. Assumption [C] measures the complexity of !F. However, if ci and C2 in 
Assumptions [Bl and ICl depend on n. then they may enter into the rate. 

Two situations are worthwhile mentioning, depending on richness of First, 
when ^ is rich, f*=f, and margin classification depends only on the behavior 
of the marginal distribution of X near the decision boundary. This is character- 
ized by the values of a and /3. For instance, in nonlinear multi-class -i/j-learning, 
a = 1 and < /3 < 1, cf., (flil). This corresponds to the case of the n^^ rate 
in the separable case and n~^/^ in the non-separable case, as described in (0). 
Second, when !F is not rich, as in linear classification, /* 7^ / is typically the 
case, where a and (3 depend heavily on the distribution of (X,Y); see Section 
15.21 for an example. As a result, actual generalization performances of various 
margin classifiers are dominated by different ideal generalization performances; 
see Section l5.ll for an example. 

Theorem 1. If Assumptions [AirCl hold, then, for any estimated decision func- 
tion vector f defined in i2.1]) . there exists a constant cg > depending on C1-C5 
such that 

P {eifJl > ciSl") < crexp(-C6n(AJo)2-""'"('^'i)), 

provided that X^^ > 26^^ Jq, where C7 = 3.5 for the bracketing entropy Hb[-,-) 
andcy = (1-^(20(1- 3^^^ ^^^^j^^3_„i„,^ i) )-^)^/^) for the uniform entropy Hu {■,■), 

and 51 = min(4 + 2ev{fo, /*), !)• 

Corollary 1. Under the assumptions of Theorem\^ |e(/,/*)| = Op{5'^), 
E\e{f ^ f*)\ = 0((5,^j"), provided that n{\Jo)^^'^^'^^^'^^ is bounded away from zero. 
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The rate 5^ is governed by two factors: (1) determined by the complexity of J- 
and (2) the approximation error evif* , fo) defined by V. When e(/*, /o) ^ 0, 
there is usually a trade-off between the approximation error and the complexity 
of :F with respect to the choice of fo; see Section [5731 

Remark 1: The results in Theorem [T] and Corollary [T] continue to hold if 
the "global" entropy is replaced by its corresponding "local" version; see e.g., 
m. That is, J^v{s) is replaced by J^i.(s) = JPy(s) n {V^{f,Z) - V{fo,Z) : 
^v{f,f*) < 2s}. The proof requires only a slight modification. The local en- 
tropy allows us to avoid to loss of log(n) in linear classification, although it may 
not be useful for nonlinear classification. 

Remark 2: For i/j-learning. Theorem [1] may be strengthened by replacing T 
by the corresponding set entropy if the problem structure is used; cf., 
Remark 3: The preceding formulation can be easily extended to the situation 
of multiple regularizers by replacing AJ(/) by its vector version, i,e, X^J{f) = 
EU ^jJAf) with A = (Ai, • • • , AO^ and J(/) = (J^ (/),... , J,(/)). 

5. Examples 

5.1. Linear classification: Ideal and actual performances 

This section illustrates that the ideal generalization performances of various 
margin classifiers, defined by may differ, dominating the corresponding 
actual ones, where e{f^,f) 7^ 0. This reinforces our discussion in Section[3l 

Consider, for simplicity, a two-class case with X generated from probability 
density q{x) = 2^^{j + l)\x\'' for x G [—1, 1] and some 7 > 0. Given X = x, 
Y is sampled from {0, 1} according to P{Y = 1\X = x) that is 0i if x > and 
62 otherwise, for constants 9i > 1/2, 6*2 < 1/2, and 6*1+^2 7^ 1- Here decision 
function vector / is (/, — /) with = {f = ax + b}, u{f{x), y) is equivalent to 
yf{x) for coding y = ±1 with J{f) = \a\. 

Four margin losses are compared with respect to their ideal and actual per- 
formances measured by e{f^^ , f) > and \e{f , f^^)\. They are exponential, 
logistic, hinge and ip losses, denoted as Vi = exp{~yf{x)), V2 = log(l + 
exp(-j//(x))), - [1 - yf{x)] + , and = I[yf{x) < 0] + (1 - yf{x))I[0 < 
yf{x) < 1]. 

To obtain an expression of ev{f,f) and e{f,f), let Ry {a,b) ~ EVj{f,Z) 
and R{a,b) = EL{f,Z). Let {a*,b*) = arginf i?yj. (a, 6), j = I,-- - ,4, and 
(a, &) = arginf i?(a, 6). The expression of i?Vj (a i ^) is given in the proof of Lemma 
[2] of the Appendix, with its properties stated in Lemma [2] 

Lemma 2. The minimizer (a, fo) = (1,0), {a*,b*), j = 1, • • • ,3 are finite, and 
(04,64) = (00,0), or equivalently, Rv^{a,b) attains its minimal as a +00 and 
6=0. 

Based on Lemma[2l we compare the ideal performances e{f^^ , /) = R{a* , 6* )— 
i?(l, 0) for j = 1, • • • , 4. Since e(/^3 , /) is not analytically tractable, we provide 
a numerical comparison in the case of 9i = 3/4, 7 = and 62 G [1/8,3/8]. 
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As displayed in Figure [l] e{f^\f) decreases as j increases from 1 to 4 with 
e{ f^'',f) = 0, indicating that dominates V1-V3. Note that e{f^^ , /) 0; 
J = 1, • • • ,4, when 0i — 3/4 and 62 = 1/4, because of symmetry of q(x) in x. 




0.15 0.20 0.25 0.30 0.35 

theta2 



Fig 1. Plot of e{f^i , f) as a function of 02, when 9i = 3/4 and 7 = 0. Solid circle, circle, 
square and solid square (from the top curve to the bottom curve) represent Vj; j = 1, - ■ ■ ,4. 

We now verify Assumptions EllUl for V^-; j = 1, • • • ,4, with Assumptions [XI 
and IB] checked in Lemma [31 

Lemma 3. Assumptions\^ and [2 are for Vj, j = I,-- - ,3, with a ~ \/2 
and (3 = 1. For V4, Assumptions\^ and\^ hold with a = 1 and [3=1. To 
verify Assumption O let fo = f^' for V, ; j = 1, • • • , 3, and compute the local 
entropy of T^^is) = Tv,{s) n {V,{f, •) - V^if^ ■) : ey^.(/,/^0 < Note 
that ev, (/,/^j) < 2s implies that ||(a, 6) — (a*,6*)|| < c's^/^ for the Euclidean 
norm || • || and some c' > 0. In addition, for any 51, 52 G J-y. (s), — g2{z)\ < 

|(ai — a2)x+ (61 — 62)1 < 2max(|ai — 02], |6i — 62|)- Direct calculation yields that 
HB{u,Ty{s)) < c(log(min(s^/^, c'u^/^)/u^/^)) for some constant c > 0. Easily, 
supj>i 0(e„, s) < Ci/Sn, when A ~ < 1. Solving (j4.ip yields = 
By Corollary[Il e(/,/^0 = Opiel") = Op(n-i/2), and Eeifj""^) = 0(7^-1/2) 
when A n~^. 

For V4, let /o = {nx,—nx). Similarly, e{f,f) = Op{n~^) and Ee{f,f) = 
0(n~^^ when A ~ -nT^ for ^/^-learning. 

In conclusion, the ideal performances for Vi — V4 are usually not equal, with 
V4 the best as suggested by Lemma [1] The actual performances are dominated 
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by their ideal performances when 6*1 + 6*2 ^ 1, although e(/, /^'O = Op{n-^l'^)\ 
j = 1, • • • , 3, and Ee{f, Z^") = 0{n-^). 

5.2. Multi-class linear classification: Arbitrarily fast rates 

This section illustrates that the rates of convergence for the hinge and logis- 
tic losses can be arbitrarily fast even in linear classification. This is because 
the conversion exponent a in Assumption VK\ can be arbitrarily large, although 

Consider four-class linear classification in which X is sampled according to 
probability density g(a;i, a;2) = Amin(|xi|, |x2|)''' for {xi,X2) & [— 1, 1]^, with 7 > 
and normalizing constant A > 0. Let Sj; j ~ 1, ■ ■ ■ ,4, be four regions {xi > 
0,X2 > 0}, {xi > 0,X2 < 0}, {xi <0,X2> 0}, and {xi <0,X2 < 0}. Now Y is 
assigned to class c = j with probability 9 {l/A < < 1) and to the remaining 
three classes with probability (1 — 6')/3 for each, when x G Sj; j ^ 1, ■ ■ ■ ,4. The 
(global) Bayes rule / = {xi + X2, —xi + X2, —xi — X2, xi — X2)'^ ■ In this case, 
decision function vector f{x) is parameterized as {wfx,W2X,w^x,—{wf + 
W2 + 'wj)x)'^ , defined by = ^12, W21, W22, ^^^31, W32). Here IF consists 

of such /'s and J(/) = J2t=i Ej=i '^Ij- 

For the hinge loss V{f, z) = h{u{f{x),y)) with 

h{u) EE hsvm2{u) = ^[ ''=^ - Uj + 1] + , 

i=i 

write EV{f,Z) and EL{f,Z) as Rv{w) and R{w). Then Rv{w) is piecewise 
differentiable, convex, and is minimized by w* = r(l, 1, —1, 1,-1,-1), where r is 
the largest negative root of a polynomial in x: Q{6 — 1) — lQ{6—l)x+\2{9—l)x'^+ 
(640 — A)x^ ~ 0. By symmetry of <z(-,-), Rv{w) is twice-differentiable at w* 
with positive definite Hessian matrix iJi, implying that = (/j^, ■ ■ • , /2^) = 
{{wlYx, {wD^x, {wlYx, -{{wir + {w*r + {wir)xY, and e(/^, /) = 0. 
We verify Assumptions [XllCl with Assumptions [XllB] checked in Lemma HI 

Lemma 4. In this example, Assumptions\^ and\^ are met for V with a ~ \/2 
and f3 ~ I. 

For Assumption [Ul we compute the local entropy of ^y(s) = ^v{s) H 
{Vif, •) - Vif^, •) : ey(/,/^) < 2s}. Note that evifj^) < 2s implies that 
\\w — w*\\ < d s^l"^ for some c' > 0, and for any q,g' g .Fi(s), \g{z) — g' {z)\ < 
ELi\fc{x)-f'Ax)\<12 maxi<c<4,i<j<3(|wcj — w' 1). Direct calculation yields 
that Hb{u,J-'y{s)) < 0(log(min 

(s1/2,c's1/2)/m1/2))^ and that sup^>i (f>{en,s) < 
Ci/en, when A ~ e^j < 1. Solving (|4.ip yields £„ = 7i^^/^. By Corollary [H 
e(/,/) = eilr) = \eif,r)\ = Op(n-(^+i)/2) and i?|e(/,/^)| = n-(^+i)/2 
when A ~ n ^. 

For the logistic loss, an application of the same argument yields that the 
same w* ~ r(l, 1, —1, 1, —1, —1) as the minimizer of Rv{w). Furthermore, a = 
(7 + l)/2 and /3 = 1, yielding that the same rates as the hinge loss. 
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Interestingly, the fast rate e(f,f) — e{f,f^) = Op{n^'^'^^^^^^) is because 
classification is easier than its counterpart-function estimation, as measured by 
7 > 0. This is evident from that ey(/, /^) = Op{n~^/^). This rate is arbitrarily 
fast as 7 ^ oo. 

5.3. Nonlinear classification: Spline kernels 

This section examines one nonlinear learning case and the issue of dominat- 
ing class with regard to generalization error for multi-class SVM with Vsvmi = 
hsvmi = ~" 2/)]+ ^-i^d multi-class '0-learning with V[p = h^p = 

'0(min{i<j<i,_i} Uj), as defined in Section[2l Consider three-class classification 
with spline kernels, where X is generated according to the uniform distribu- 
tion over [0,1]. Given X = x, Y = {Yi,--- ,¥3) is sampled from {P{Y = 
1\X = x),P{Y = 2\X = x),P{Y ^ 3\X = x)) = {pi{x) , p^ix) , p^{x)) , which is 
(5/11,3/11,3/11) when x < 1/3, (3/11,5/11,3/11) when 1/3 < x < 2/3, and 
(3/11,3/11,5/11) when x > 2/3. Evidently, for each x G [0,1], there does not 
exist a dominating class because maxi<i<3Pi(a;) = 5/11 < 1/2. 

In this example, :F = {{fij2,h) ■ fi & W^m[0, 1], X^Li /» = 0} is de- 
fined by a Sobolev space IV,„[0, 1] = {/ : /^™~^Hs absolutely continuous, /^"^^ G 
L2[0, 1]} with the degree of smoothness m measured by the io-norm, generated 
by spline kernel A'(-, •) whose expression can be found in (jlOl ) or (|23l ). In what 
follows, we embed {{fi,f2,h) ■ fc = YH=iCticK{xi,x)} with penalty J(/) = 
ELi Io fc'^X^fdu in ^ into T with penalty J(/) = ELi Er=o' f^c\oY + 

fc"^\u)^du. It follows from the reproducing kernel Hilbert spaces (RKHS) 
representation theorem (cf., (fiol )) that minimization of (|2.ip over ^ is equiv- 
alent to that over its subspace {{fi, f2, fs) ■ fc ~ 'Yili=\'^'i.cK{xi^x')\ with 

^(/) = ELi/o/^H«)'rf«. 

We now verify Assumptions EIICl Some useful facts are given in Lemmas [5][7l 

Lemma 5. (Global Bayes rule f) In this example, f{x) = (/[O < x < 1/3] — 
1/3, /[1/3 < a; < 2/3] - 1/3, /[2/3 < a; < 1] - 1/3) = &YginifEV{f,Z) for 

V = Vsvml , V^, . 

Lemma 6. (Assumption\^ In this case, eyT{f , f*) > Ce{f,f*) for V = 
Vsvml, some constant C > 0, any T > 9 and any measurable / G M^. 

Lemma 7. (Assumption\^ In this example, E{V'^{f,Z) - V{f*,Z))'^ < 
CE{V'^ {f , Z) — V{f* , Z)) for some constant C > 0, any T > 9 and any mea- 
surable f ^ MP , where V = Vsvml- 

By LemmaEl f{x) ^ (/[O < a: < 1/3] - 1/3, /[1/3 < x < 2/3] - l/3,/[2/3< 
X < 1] — 1/3), which can be approximated by with respect to the Li-norm || • ||i 
(ll/lli E\f{X)\). Hence f*^f = f^^ argmf fEV{f, Z) for V = Vsvmi, V^. 

SVM: Let /o = (/'^^ -/^^^ - /*^\/^^^) with /(i' = 2/3 - (1 + cxp(-T(i- - 
l/3)))-\/f3) = (l+cxp(-T(a:-l/3)))-^-l/3, with parameter T > to be spec- 
ified. Set T = 4. Then T > max(sup^ K,„„i(/o, 2), sup^ V^5„™i(/*, z)) > 0. By 
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Lemmas m and [71 Assumptions lAl and [b1 arc met with a ~ 1 and f3 ~ 1. For As- 
sumptionlC] it can be verified that ev^^„,i (/o, /*) < 2 J^\\fo{u) — f*{u)\\idu = 
O(r-i), and J(/o) = 0(t2"-1). By Proposition 6 of Q), Hb{u,Tv,^^As)) = 
0(((Jos)1/Vm)^/™) with Jo = 0(r2'"-i). Solving g3]) yields a rate 4 = 
0((J(f"n~^/^)2^) when JqA e^. As a result, we have ey^^^^ (/,/*) = 
Op(max(e2,ey,(/o,/*)) = Op(max(T^n-^, i)) = Op{n-^'^), with a 
choice of T n}/"^ and A ~ n^'". 

yj-learning: Set T>lasO<V4<l. For Assumption [XI a = 1 by The- 
orem 3.1 of (|l6f ). For Assumption [Bl /? = 1 following an argument similar to 
that in (flil). For Assumption [Cl let /o(2;) = t(4 - 9x, l,9x — 5) when m > 2; 
/o = (/(^),-/(^) - with /(I) = 2/3 - (1 + exp(-r(a; - l/3)))-\ 

= (1 + exp(-T(x - l/3)))-i - 1/3, when m = 1. Then ey^ifoJ*) = 
0{t~^), J{fo) = 0{t'^) with q = 1 when m = 1 and q = 2 when m > 2, 
and iJB(u,J^y^(s)) = 0(((Jos)i/Vu)i/'"). Solving gl]) yields a rate = 

0((Jo^n-V2)2^) when JqA - e^. By Corollary [J when m > 2, e(/, /) = 
Op{m^x{elev^{foJ))) = Op(max(T^n"^ , i)) - Op(n-2™/(2rn+3)). 
when TO = 1, e(/, /) = Op{n-^/^) r - n^/^ and X ^ This yields £;e(/, /) = 

0(-„-2m/(2m+3)) ^^^J^gj^ m > 2, with T ^ „2m/(2m+3) j^^^^J J)^ ^ „-6m/(2m+3) . 

Ee{f, f) = 0(n~^/2) when to. = 1 with r n^/^ \ ^ ^^-i^ 

Evidently, the approximation error ey(/o, /) and Jo = ^8ix{J{fo), 1) play a 
key role in rates of convergence. With different choices of approximating /g for 
■i/'-loss and the hinge loss, ^/^-learning and SVM have different error rates with 
the '0-loss yielding a faster rate when to > 2 and the same rate when to = 1. 
Moreover, in this example, the dominating class does not seem to be an issue. 



5.4- Feature selection: High- dimension p but low sample size n 



This section illustrates applicability of general theorem to the high-dimension, 
low sample size situation. Consider feature selection in classification, where the 
number of candidate covariates p is allow to g reatly exceed the sample size n 
and to depend on n. For the Li penalty, (|22| ) and l|28h obtained the rates of 
convergence for the binary SVM when p < n and multi-class SVM whenz) > n. 

Here we apply the general theory to the elastic- net penalty (see (|3l[)) for 
binary SVM (|27|) t o obtain a parallel result of (28|). We use linear representations 
in (|2.1|) as in (|27l). because of over-specification of non-linear representations. 
Here decision function vector / is (/, — /) with f G = {f{x) = w^x : x G 
[-1, 1]P}, and J(/) = Je(/) = 9\\w\\i + {I - 0)\\w\\l is a weighted average of 
the Li and L2 norms with a weighting parameter 9 e [0, 1], cf., (|3l[ ). 

In this example, (X = (A"i, • • • , Xp), Y) are generated as follows. First, ran- 
domly sample X according to the uniform distribution [—1, 1]^*. Second, given 
X = a;, y is sampled according to P{Y = l\Xi = xi), which is r > 1/2 if 
Xi > 0, and 1 — r if xi < 0. This is a version of Example 5.1 with 7 = and 
01 = T and 02 = 1 — T in a high-dimensional situation. Evidently, {X2 • • • , Xp) 
are redundant variables. 
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We now verify Assumptions [AllCl for the hinge loss V . Because Xi and Y 
are independent of {X2, ■ ■ ■ ,^p), one can verify that the minimal of i?F(f, Z) 
is that of EV{Ji{Xi),Y) over {/i : fi{x) = axi + 6}, attained by Jl = a*xi 
for some a* > 0. For Assumptions [XllBl we apply the result in Example 5.1 to 
obtain a = 1/2 and /? = 1. 

For Assumption [Cj we apply Lemma[8]to compute Hij{e, J-'v{s)). 

Lemma 8. For Gp{s) = {f{x) = w^x : w,x ^ [— 1, 1]^, ||u;|| 1 < s} and any 
£ > 0, there exists a constant c > such that Hu{e,Gp{s)) < cs^ {p \og{l + -^) + 



e-2log(pe2 + 1)). 

Note that T{s) C Gpie-^s), by Lemma[51 J^v(s)) = 0(plog(l + ^) + 

£-2log(p£2 + l)). Set A - el/{2 J{f*)). To solve sup^ (/)(e„, s) < 02^1/2, note that 
supg (/)(e„, s) = s*) for some finite s*. Then it suffices to solve (/>(en, s*) < 
C2n^/^, involving 



Three cases are examined: First, when pe^ ~ o(l), and Ii + I2 < 2/i = 
0((pe^ log((pe^)~^))^/^). Solving (/)(e„,a*) < C2n^^^ is equivalent to solving 
(pe^)i/^log^/^((p£^)"i) = 0(nV2£^) with respect to el, which yields = 
0{{p/n)log{n/p)). When pe^ = 0(1), there exist two constants < Bi,B2 < 
00 such that Bi < h + I2 < B2, implying ~ 0(n~^/^). When pe^ — > 
00, h > h and h+h < 2/2 = 0(logi/^(p4)log(£2~^)). Solving equation 
log^/^(p£^J log(e^~"'^) = 0(n^/^£^) yields = (n~^ log(t„p))^/^ log(n) when 
tn = (n-^ogp)i/2log(n/logp). 

As a result, the rate is £^ = log(^))"^^^ when p ^ n^/^, = n~^/^ 

when p = 0(n^/^), and £„ = [(n"^ log(i„p))^/^ log(7i)]^/^ with a choice of tn = 
logp)^/^ log(jj|^) when p ^ n^/^ but \ogp/n = o(l). Note that in the 
last case Assumption [B] plays no role when is too large. By Corollary [H 
e(/, /) = Op(£„) when A - £2 . 

6. Conclusion 

This article develops a statistical learning theory for quantifying the generaliza- 
tion error of large margin classifiers in multi-class classification. In particular, 
the theory develops upper bounds for a general large margin classifier, which 
permits a theoretical treatment for the situation of high-dimension but low sam- 
ple size. Through the theory, several learning examples are studied, where the 
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generalization errors for several large margin classifiers are established. In a lin- 
ear case, fast rates of convergence are obtained, and in a case of sparse learning, 
rates are derived for feature selection in which the number of variable greatly 
exceeds the sample size. 

To compare various large margin classifiers with regard to generalization, we 
may need to develop a lower bound theory. Otherwise, a comparison may be 
inconclusive although our learning theory provides an upper bound result. 
Acknov^rledgments. The author would like to thank the reviewers for helpful 
comments and suggestions. This research was supported in part by National 
Science Foundation Grants IIS-0328802 and DMS-0604394. 



Appendix A: Technical proofs 

Proof of Lemma [TJ To prove EL{f^^',Z) = EL{f^, Z), note that it follows 
from the definition of that EL{f^,Z) > EL{f^,Z). Then for any e > 
there exists fo & such that EL{fo,Z) < EL{f^,Z) + e. It follows from 
linearity of ^ that cfo G J- for any constant c > 0. The result then follows from 
the fact that limc^oo EV^{cfo, Z) = EL{fo, Z). 

It follows from the fact that EL{f^,Z) > EL{f^,Z) that EL{f'\Z) > 
ELif''^,Z) = EL{f^,Z). 

For V = hgyjnj, in the separable case, the result follows from that hsvmiiz) > 
hsvmsiz) = ^V^,{z) for z > 0. 

Proof of Theorem [1] First we introduce some notations to be used. Let 
Vif,Z) ^ V{f,Z) + XJif) and V^if,Z) = V^if,Z) + AJ(/). Define the 
scaled empirical process En{V^{f ,Z) - V^{fo,Z)) as n"! X;r=i(^^(/' ^0 " 
V^ifo,Z,). Let A,, ={f&T: ^'Hl < eyT{fJ*) < 2'^^ , 2^-1 max( Jq, 1) < 
J(/) < 2^"max(Jo,l)} and A,,o = {/ eJ^:T-'Sl < eyrifj*) < TSlJif) < 
max(Jo, 1)}, for j = 1, 2, • • • , and i = 1,2, ■ ■ ■ . 

The treatment here is to use a large deviation inequality in Theorem 3 of 
for the bracketing entropy and Lemma [9] below for the uniform entropy. Our 

approach for bounding P ^|e(/, /*)| > is to bound a sequence of empirical 

processes induced by the cost function I over P{Aij); i, j = 1, ■ ■ ■ ,n. Specifically, 
we apply a large deviation inequality for empirical processes, by controlling the 
mean and variance defined by V{f, Zi) and penalty A. This yields an inequality 
for empirical processes and thus for e{f,f*). In what follows, we shall prove the 
case of the bracketing entropy as that for the uniform entropy is essentially the 
same. 

First we establish a connection between e(/,/*) and the cost function. By 
the definition of /, fo [ey'^ifo^ f*) ^ ev(/o,/*) < 5^), and Assumption YK[ 
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/*)| > ci(52"} c {eyT(/, /*) > 5l] is a subset of 

sup Y.^V{f^,Zi)-V{f,Zi))>{)\ 

C I sup y2^V{fo,Z,)-V^{f,Z,))>o\. 

[{fey^-e,,T{fJ')>sU'~i J 

Hence P ^|e(/, /*)| > ci(5^"^ is upper bounded by 

I = P*l sup n-^y2^V{fo,Z,)-V^{f,Z,))>o], 

< h+h, 

where P* is the outer probabiUty. and 

h - ™P J2iV{fo, Zi) - Z,)) > ) 

/2 = Vp* sup n-iV(y(/o,Z,)-l^^(/,Z,))>0 . 

z=i V/e-^.o y 

To bound h, consider P* (sup_^g^, ^. n'^ EtiiVifo. Z,) - V^if, Z,)) > o) , 

for each i = I,-- - ,j = 0, • • • . Let M{i,j) = 2^-^51 + X2i-\j{fa). For the 
mean, using the assumption that > XJq and the fact that eyr(/o,/*) = 
evifoJ*) < 51/2, it foUows that inf^, , E(V^^(/, Zi) - V[fo,Zi)) is lower 
bounded by 

i^nf E(l/^(/, Zi) - V{f\Z^) + \{J{f) - J(/o))) - eyT(/o, /* ) > M(z, j), 

i — 1, • ■ ■ , j = 0, • • • . Similarly, for the variance, it follows from Assumption IB] 
and the fact that Var{V'^{f, Zi)-V{fo, Zi)) < 2[Var{V^{f, Zi)-V{f* ,Zi))+ 
Var{V{fQ,Z^)-V{r,Z^))] that 

sup Far - V{h,Zi)) < {i, j); 

i = 1, • • • , j = 0, • • • , . 

Note that < (5„ < 1 and Amax(Jo, 1) < (5^j/2. An application of Theorem 
3 of 113) with M = n^/^M{i,j), v = Ac2Ml^{i,j), e = 1/2, and \V{fo,Z,) - 
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Zi)\ < 2T, yields, by Assumption O that 

(l-£)nM(i,i)2 



h < 3exp 

ij:A/(l,i)<T 



2(4M'3(^,J)+M(^,j)^/3) 



< ^^3exp(-C6nM(i,j)2-™"(i 

CX3 OO 

< ^^3exp(-C6n[2'-i,52 + (2^-i - 1)A Jo]'"™"^'"''') 



OO OO 



- im3exp(-C6n[(2'-i52)2-min(i,;3) ^ ((2^^! - 1)A Jo)'-™"^^'^)]) 

< 3exp(~C6n(AJo)2-™(i'^))/[(l - cxp{-cen{XJ„f-^'-^''^^))]\ 

Here and in the sequel cg is a positive generic constant. Similarly, I2 can be 
bounded. 

To prove the result with the uniform entropy, we use Lemma [5] with a slight 
modification of the proof. 
Finally, 

I<h+h< 6exp(-C6n(AJo)2-"""(i-«)/[(l - exp(-C5n(A Jq)'""""''''^)))]'. 

This implies that /^/^ < (5/2 + /i/^) cxp(-c6n(AJo)2-"""(i''3)). The rcsuh then 
follows from the fact / < /^/^ < 1. □ 

Now we derive Lemma |9] as a version of Theorem 1 of using the uniform 
entropy. 

Lemma 9. Let T he a collection of functions f with < f < 1, Pn{f) = 
n-^E7=ifiYi), Pf = Ef{Yi) = 0, ~ i.i.d, and let v > sup^g^P/^ = 
supf^yrVar{f). For M > and real e (0,1), let 'ip{M,n,v) = and 
s = Suppose 

Hu{v^'\T)<-^i:{M,n,v), (A.l) 
M < 16(1 - 361/4)1/2^, (A.2) 

and, if s < v^l'^ , 

1(5/4, / Hu{u,TY'^du< 

Js/i 256 



(A.3) 



Th 



en 



P*(sup|P„/i-P/i| >4Af) < 10(1-— —i -) 'exp(-(l-0)7^(M,n,i))). 



Proof: The proof uses conditioning and chaining. The first step is conditioning. 
Let Zi, . . . , Zfq be an i.i.d. sample from P, and let (Pi, . . . , Rn) be uniformly 
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distributed over the set of permutations of (1, . . . ,-/V), where N ~ mn, with 

m = 2. Define n' ^ N - n, PnM = Yl7=i ^Zr^ , and = N^'^ J2i=i ^z^-, 
with the Dirac measure at observation Z^. Then the following inequality 
can be thought of as an alternative to the classical symmetrization inequality 
(cf., IH) Lemma 2.14.18 with a = 2^1 and m = 2), 

P*(sup \P,,h - Ph\ > 4A/) < (l - V*(sup \Pr,.Nh - PNh\ > M). 

(A.4) 

Conditioning on . . . , Zj^ ^ it suffices to consider P|'^(supjp |Pn,JV^ — P/v/i| > 
A/), where P\-^ be the conditional distribution given .Zi, . . . , 2'jv. 

The second step is to bound P|'^(supjr \Pn.Nh — Pn^\ > M) by chaining. 
Let eo > ei > ... > Et > be a sequence to be specified. Denote by 
the minimal e^-net for with respect to the i2(-PAf)-norm. For each h, let 
TTqh = argmiuggjF^ \\g - /i||p„,2- Evidently, Wnqh - /i||p„,2 < £q, and \Tq\ = 
N{£q,T,L2{PN)), the covering number. Then P|"Jy(sup^ \Pn,Nh — Ptv^I > M) 
is bounded by 



P|V(sup \{Pn,N - PN){7Toh)\ > (1 - -)M) 

+P|V(sup |(P„.w - PAr)(7ro/i - ■nTh)\ > ^) 

H o 
+P|*jv(sup |(P„.w - PN){^Th - > ^) 



< |.fo|supP|V(|(P„,iv -PAr)(7ro/i)| > (1 - -)M) 

T 



^ \Tq\\Tq^-Y\ SUP-P|'^(|(^n,Ar " P/v) (tT, - TT^- 1 /l) | > ^q) 
9=1 

SUpP|V(|(P„,JV - PN){l^Th~h)\ > ^) 



P,+P2+P 



where 



V, = eq^d'-^^^^^^r/^;q = l,...,T, (A.5) 

and Eq = Hu{jip{M,n,v),T)~ , £q+i = s V supja; < e,/2 : Hu{x,T) > 
4Huieq,J^)}; (7 = 0, ...,T, and T = mm{q : Sq < s}. Note that eo < v^^^ 
by construction. Furthermore, by (|A.3p and Lemma 3.1 of ([l|), 

hy = E^9-i(^^^^)^/^ < 7^^(V4,.^/^) < OM/S. (A.6) 



Wc now proceed to bound P1-P3 separately. 
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On Cn = (sup^Pjv/i' < 64w), al = Pat^/ - Pn^o? < ^ivK/)' < 64w, 
by Massart's inequality, cf., (jii), Lemma 2.14.19, P|'^(|(P„,jv--Pjv)(7I'o/i)| > (1- 
f )M) < 2 exp(-n(l - 9 / A)^ / {2a%)) < 2 exp(-(l - e/A)^ilj{M, n, i;)). By the 
choice of £0, Pi < 2exp(iJc/(eo,^))exp(-(l - 6'/4)V(M,n,'(;)) < 2exp(-(l- 
0)ip{M, n, v)). On C^r, it follows from Lemma 33 of {11) that P* (C]v) is bomided 
by P*(sup^(PAr/i2)i/2 > 8„i/2) < 4cxp(-iVw + Hu{v^/^,T)) < 4exp(-2™ + 
^iP{M,n,v)) < Aexp{-{l-0)^{M,n,v)) 

For Pj, if eo < s, let Et = £o- Then P2 = 0. Otherwise, consider the case of 
eo > s. Note that PNi-^qh - iTq-ihf < 2(P/v(7r,/i - + P{h - TTq-ihf) < 
2e^ + 2e^_i < 4e^_i. By Massart's inequality, P*j^{\{Pn,N-PN){Trqh~7:q-ih)\ > 
Tjq) < 2cxp{-nr]'^/{2alf)) with ajf < PN{i^qh- iiq-ihf < 4£^_^, and by the 
choice of r^g, g = 1, . . . , T, 

N 

P2 < ^|.F,|2supP|V(|(P„.JV-Pjv)K/l-^g-l/i)| >?79) 
■7=1 ^ 

T 2 T 

< 2^cxp(2i/a(£g,^) - = 2^exp((2- 2/0)ifc;(£„.F)) 

g— 1 9"! q—1 

00 

< 2 ^ exp((2 - 2/0)4'?iJ[;(£o, ^)) < 4exp(-(l - e)i^{M, n, v)). 

9=1 

For P3, note that Pn.N f < 2PAr/ for any / > 0, and Pfq{'KTh — h)'^ < e\ by 
the definition of TTT. Then \{Pn,N - PN){'!^Th-h)\^ < 2{Pn,N + PN){T^Th~hY < 
6e|, < {OM/%f because et < s ^ So P3 = 0. 

Now 

P\N* (sup |P„,Ar/l - Pw/l| > M) < P1+P2+P3 



< 6exp(-(l - 6l)V'(M,n,w)). 



After taking the expectation with respect to Zi, . . . , Zjv, we have, from (|A.4p . 
that P*(sup^ \Pnh — Ph\ > 4A/) is upper bounded by 

^»(^- 32^,(M,n,^) )''"^P^'^^"^^^^^^'"'^^^- 
This completes the proof. 

Proof of Lemma [2] : It can be verified that, with Xj > constants, 

02e-'' f e^^^-xydx + (1 - 6*2)6'' / e"^(-a;)^dx) ; 



1 
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Rv^ (a, b) can be expressed as 

A2 [Oi log(l + e-""-'')2;^dx + (1 - 61) log(l + e""+'')2;^dx+ 

/O M 
log(l + e~""-'')(-x)''dx + (1 - 02) y log(l + e""+'')(-x)Tdx 

Rv3{a,b) can be written, in the region of interest {{a,b) : a < 0,-1 < —(1 
b)a-^ < 0, < (1 - b)a-^ < 1}, as 

A3(0i((7 + 1)(7 + 2))'\1 - by+'a-'-' + (1 - + ^) + 

7+1 7+2 

e2(^ + ^) + (1 - e2)((7 + 1)(7 + 2))-^(i + by+\~''-'y, 
7+1 7+2 

Rvi{a, b) can be written as, when a > and 6 > 1, 

^n(7+i)(7+2)(^^ + ~ 

- (7 + l)(7 + 2) ^^ - —^^^ + ^^/(^ + ^0 ^ 
when a > and < 5 < 1, 



(7 + l)(7 + 2) aT+i aT+i ' ' 7 + I 

1 r ('^)^^% ^ 1-02 (l + b)^+^ 

(7+l)(7 + 2)^ a^+i ^+(7+l)(7 + 2) a^+i J' 

when a > and — 1 < < 0, 

^1 (1 - 6)^+2 ^ ^ ^ ^ ^ , 1 



^< f.+iK.+2) + - + 1) + 



(7 + l)(7 + 2) aT+i ' ' '7 + 1 

1 r^^^ 1-^2 . (1 + ^)^+' &^+^ 



(7+l)(7 + 2) (7+l)(7 + 2)' oT+i aT+i' 

when a > and 6 < — 1, 

- ^^)/(^ + 1) + ^^(TTT - (7 + l)(7 + 2) (^ - 

1-^2 . (i+fer+' _ b-<+-^ 

(7 + l)(7 + 2)^ a-r+i a7+i^^- 

Similarly i?(a, 6) = i((l-6ii+6l2) + (/[6 < 0](26li-l)+/[5 > 0](l-26l2))|6/a|''+i) 
when a > 0. The results can be verified through direct calculation. 
Proof of Lemma[3] : For , j = 1, • • • , 3, let /* = /^j . We verify Assumptions 
[XI and |B] through the exact expression of Vj given in the proof of Lemma [H 
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although Taylor's expansion is generally applicable. Note Rvj(a,b) is strictly 
convex and smooth; j = 1,2 and Rv^{a,h) is piecewise smooth and strictly 
convex in the neighborhood of (03,63). For any (a, 6) in the neighborhood of 
(a*, 6*) and some constant di > 0, ey. ) = RvAa,b) - RvAa*,b*) > 

di{a ^ a*,b — b*)HY.{a* ,b*){a — a*,b ~ b*)'^ with a positive definite matrix 
Hv,{a*,b*). Moreover, e(/, /^O R{a,b)-R{a*,b*) = {I[b < 0](26'i-l)+/[6 > 
0](l-26l2))(|6/a|T+i-|6*/a*p+i)/2. By the assumption that 6*1 +6*2 ^ I, b* ^ 0, 
< cil&j/oj — 6*/a*| for some constant c\ > 0, implying Assumption 
|A] with a = 1/2. For Assumption \B\ note that |/| < Ti for some constant 
Ti > wheney(/,/^0 is small. Then \V,if, z) ~V,if^^ , z)\ < V^{-T,)\f{x)- 
r^{x)\=V^{-Ti)\ia-a*)x+ib-b*)\. Hence VariV.if , Z) ~ V,{r^ , Z)) < 
C2E{{a - a*)X + [b - b*)f = {a-a*,b- b*)Dv^ {a-a*,b^ b*f with Dy^ a 
positive definite matrix, implying Assumption IB] with /3 = 1. 

For V4, the minimal of i?vi(a, b) is attained as a ^ 00 and 6 = 0, independent 
of 01, 6*2 and 7. Note that f^" = / since EViif, Z) = MfeJ^ EV^if, Z). Direct 
calculation yields that ev{f,f) — Rv4{a,b) — linia^oo Rvi{o-,0) > c^lb/al"^^^ , 
and |e(/,/)| < C5|6/a|''+^. This implies Assumption lAl with a = 1. For As- 
sumption [b1 it follows from the fact that V4{f,z) ~ L{f,z) for any z that 
VariV^if, Z) - V,if, Z)) < 2E\Vi{f, Z) - V^if, Z)\ < 2E\L{f , Z) - L{f, Z)\ + 
2E{Vi{f, Z)-L{f, Z)). Furthermore, E{L{f, Z)-L{f, Z)) = E\2P{Y_ = l\X = 
x) - l||i(/, Z) ~ L{f, Z)\ > min(|20i - 1|, \202 - l\)E\Lif, Z) - L{f, Z% and 
E{V4{f,Z) - L{f,Z)) < E{V4{f,Z) - L{f,Z)) = ey, (/,/). Therefore, As- 
sumption IbI is met with /3 = 1. This completes the proof. 

Proof of Lemma |4j For Assumption |^ note that for any w = w* + Aw in 
a small neighborhood of w* , ey(/, /^) = Rv{w) — Rv{w*) > d2Aw^ HiAw. 
Furthermore, direct computation yields that e{f,f^) < d3(Ai(;-^Ai(;)('''+^^/^, 
for some constant ^3 > 0. Hence, for some constant C2 > 0, |e(/,/^)| < 
C2ev(/, for all small Aw, implying AssumptionlAlwith a — (7+l)/2. 

For AssumptionE it follows from the fact \V{f, z)-V{f^,z)\ < ^^^^ |/c(a;)- 
fYix)\ that Var{V{f, Z) - V{f^,Z)) is upper bounded by 

E{j2 \fcix) - fYmir < mjz^ux) ~ f^{x)f) = aw^h^aw 

c=l c=l 

with H2 a positive definite matrix, implying that Var{V{f, Z) — V{f^ ,Z)) < 
c^evif, f^) for some constant C3 > and all small Aw, and thus Assumption 
[B]with /3 = 1. This completes the proof. 

Proof of Lemma [5j We use a pointwise argument. First consider V = Vsvmi- 
Note that EVif,Z) = E{E{J2l=iPc{X)hsvmi{u{f{X),c))\X)). Now define 
hpif) to be ELiPcVcif) for any / e where = Ec=iPcE,^c(l " 

(/c — fj))+- We now verify that / = (2/3, —1/3, —1/3) minimizes hp{f) when 
p = (5/11, 3/11, 3/11), for X G [0, 1/3]. The other two cases when x S (1/3, 2_/3] 
and X £ (2/3,1] can be dealt with similarly. Now rcparametrize / as / -f 
iri,r2,r3)'^d with J^c'^c = 0, J2c\'^c\ = 1 and d = \\f - f\\i. When d is sufB- 
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ciently small and p = (5/11, 3/11, 3/11), 

hp{f)-hp{f) = pi{{r2-n)+d+{r3-n)+d)+p2{2 + {ri-r2)d+l 
+ {r3 - r2)d) + P3(2 + (ri - ra)^ + 1 + (ra - ra)^) 
-(3p2 + ipz) 

(Pi(?'2 - ri)+ +pi(r3 - ri)+ +_P2(''i - ''2) +P2(''i - r-i))d 
> Cid, 

where Ci = min^ r,=Q,J2 \ra\=i^P^^'^^ ^ ''1)+ + P'^i'^s - ri)+ + P2{ri - ra) + 
P2{fi — ^3)) > 0. By convexity of hp{f), f is the minimizer. Combining the 
three cases, we obtain with /(x) = (/[O < a; < 1/3] - 1/3, /[1/3 < a; < 2/3] - 
1/3, /[2/3 < a; < 1] — 1/3) that f*{x) minimizes hp(^x){f{x)) for each cc, thus 
EV{f,Z) = Ehp{^x){f {X)) . For V = V^, it follows from an argument similar 
to the proof of Theorem 3.2 of (flih . This completes the proof. 
Proof of Lemma [6j We will apply an argument similar to that in the proof 
of Lemma m Let /i^(/) be ELiPcV^/(/) with [f) = T A V,{f) with 
Vc{f) defined in the proof of Lemma [5l Note that f* ^ f and eYT{f,f*) = 
E{E{V^{f,Z) - Vir,Z)\X)) = E{hp(x){f{X)) - hp^x){riX))). By Theo- 
rem 3.1 of Liu and Shcn (2006), e{f,f*) = £'(maxcPc(-''^) ^ Pargmin / {x)(-^))- 
It then sufRccs to show that hp{f) — hp{f*) > C(maxcPc ^Pargmin / ) ^'^^ 
measurable f & and some constant C > 0. Suppose p = (5/11,3/11,3/11) 
without loss of generality. Then pi = maxcPc, and the proof becomes trivial 
when /i ~ maxc/c- Suppose /i < maxc/c and further /i < /a without loss of 
generality Then H/-/* ||i > l/i-AI + l^-/!! > l(/r-/2*) + (/2-/i)l > 1 (/i* = 
2/3 and /| = —1/3). Two cases are treated separately. When maxc Vdf) < T or 
Vc^if) = K(/), hl{f)^hl{r) > Ci||/-r||i > Ci by the proof of LemmaEl 
When max, K(/) > T or max, K^(/) = T, /i^(/) > (min,p,)(ELi > 
(3/ll)T, and hp{f*) = 18/11. Then, hp{f )-hp{f*) > 37/11 - 18/11 = 9/11 > 
maxcPc "-Pargmin / =2/11. The desired result follows. 

Proof of Lemma [3 The proof uses the pointwise argument and is similar 
to that of LemmaEl Note that E{V'^{f,Z) - V{f\Z)f < TE\V'^{f,Z) - 
Vir,Z)\=TEip,iX)\V,^ifiX))~V^{f{X))\+p2iX)\V2^ifiX))^V2{r{X))\ 
+ P3{X)\V^{f{X)) - V3{f*{X))\), with Vcif) defined in the proof of Lemma 
m It suffices to show that for any measurable / G Left < CRight for 
some constant C > with Left = pi\V^if) - Vi{f*)\ +P2\V^if) - V^2(/*)| + 
PslVs^if) - V3{f*)\ and Right = pi{V^{f) - Vi{f*)) +V2{VT{f) - V2{f*)) + 
Pii^zif) ^ ^(/*))- Two cases are examined. 

(1) If M{x) = maxi<,<3K(/) > T, then maxi<,<3 K:^(/) = T. It follows 
that Left < T and Right > 3r/ll — 18/11, as shown in the proof of Lcmma[6l 
This implies that Left < llRight because T > 9. 

(2) If M{x) < T, we prove the non-truncated version of the inequality. Note 
that |K(/)-K(r)| < ||/-r||i,c= 1,2,3. Then ie/t< 11/ -rill for any 
/ e R^. Following the proof of Lemma [6l we have Right > Ci||/ — /*||i > 
CiLeft. 
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Proof of Lemma D Note that Hu{'2s£,gp{s)) = Hu{e,g*{s)) with ^^(s) = 
{(2s)^^/ : / G Qp[s)}. In addition, Gp{s) is the convex hull of ztxj/2, j = 

l,...,p. Then Hu{k~^/^g;{s)) < log CP+t') < log((2p + fc)!) - log((2p)!) - 
logffc!) for any integer fc > 1 using the argument in the proof of Lemma 2.6.11 
of l|25l). By Stirling's formula, 

y2;^„n+l/2g-n+l/(12n+l) ^ ^, ^ ^^„+l/2g-„+l/(12„) ^ 

implying that Hu{k~'^^^ ,g*{s)) is no greater than 2plog(l + k/{2p)) 
+ /clog(2p/fc + 1) -logV2^+ (12(2p + fc))-i - (12(2rt + l)-i - (12fc + l)-i. 
Let e > fc^^/^. Then there exists a constant c > such that Hir{e,Q*{s)) < 
c(plog(l + + e^^ log(pe^ + 1)). This completes the proof 
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